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Abstract 


Relation classification is an important se¬ 
mantic processing task for which state-of- 
the-art systems still rely on costly hand¬ 
crafted features. In this work we tackle the 
relation classification task using a convo¬ 
lutional neural network that performs clas¬ 
sification by ranking (CR-CNN). We pro¬ 
pose a new pairwise ranking loss function 
that makes it easy to reduce the impact 
of artificial classes. We perform experi¬ 
ments using the the SemEval-2010 Task 
8 dataset, which is designed for the task 
of classifying the relationship between two 
nominals marked in a sentence. Using CR- 
CNN, we outperform the state-of-the-art 
for this dataset and achieve a FI of 84.1 
without using any costly handcrafted fea¬ 
tures. Additionally, our experimental re¬ 
sults show that: (1) our approach is more 
effective than CNN followed by a soft- 
max classifier; (2) omitting the representa¬ 
tion of the artificial class Other improves 
both precision and recall; and (3) using 
only word embeddings as input features is 
enough to achieve state-of-the-art results if 
we consider only the text between the two 
target nominals. 


1 Introduction 


Relation classification is an important Natural 
Language Processing (NLP) task which is nor¬ 
mally used as an intermediate step in many com¬ 
plex NLP applications such as question-answering 
and automatic knowledge base construction. Since 
the last decade there has been increasing interest 
in applying machine learning approaches to this 
task ( Zhang, 2004} Qian et al., 2009[ Rink and 


Harabagiu, 20101. One reason is the availability 
of benchmark datasets such as the SemEval-2010 


task 8 dataset ( [Hendrickx et al., 2010| ), which en¬ 
codes the task of classifying the relationship be¬ 
tween two nominals marked in a sentence. The 
following sentence contains an example of the 
Component-Whole relation between the nominals 
“introduction” and “book”. 

The [introduction] ei in the [book] 62 is a 
summary of what is in the text. 

Some recent work on relation classification has 
focused on the use of deep neural networks with 
the aim of reducing the number of handcrafted fea¬ 


tures ( [Socher et al., 2012} Zeng et al., 2014 Yu et 


al., 20141. However, in order to achieve state-of- 


the-art results these approaches still use some fea¬ 
tures derived from lexical resources such as Word- 
Net or NLP tools such as dependency parsers and 
named entity recognizers (NER). 

In this work, we propose a new convolutional 
neural network (CNN), which we name Classifi¬ 
cation by Ranking CNN (CR-CNN), to tackle the 
relation classification task. The proposed network 
learns a distributed vector representation for each 
relation class. Given an input text segment, the 
network uses a convolutional layer to produce a 
distributed vector representation of the text and 
compares it to the class representations in order 
to produce a score for each class. We propose a 
new pairwise ranking loss function that makes it 
easy to reduce the impact of artificial classes. We 
perform an extensive number of experiments using 
the the SemEval-2010 Task 8 dataset. Using CR- 
CNN, and without the need for any costly hand¬ 
crafted feature, we outperform the state-of-the-art 
for this dataset. Our experimental results are ev¬ 
idence that: (1) CR-CNN is more effective than 
CNN followed by a softmax classifier; (2) omit¬ 
ting the representation of the artificial class Other 
improves both precision and recall; and (3) using 
only word embeddings as input features is enough 
to achieve state-of-the-art results if we consider 
only the text between the two target nominals. 
















The remainder of the paper is structured as fol¬ 
lows. Section details the proposed neural net¬ 
work. In Section we present details about the 
setup of experimental evaluation, and then de¬ 
scribe the results in Section In Section we 
discuss previous work in deep neural networks 
for relation classification and for other NLP tasks. 
Section [^presents our conclusions. 

2 The Proposed Neural Network 

Given a sentence x and two target nouns, CR-CNN 
computes a score for each relation class c £ C. 
For each class c £ C, the network learns a dis¬ 
tributed vector representation which is encoded as 
a column in the class embedding matrix ppcZasses 
As detailed in Figure [T] the only input for the net¬ 
work is the tokenized text string of the sentence. In 
the first step, CR-CNN transforms words into real¬ 
valued feature vectors. Next, a convolutional layer 
is used to construct a distributed vector represen¬ 
tations of the sentence, r^. Finally, CR-CNN com¬ 
putes a score for each relation class c G C by per¬ 
forming a dot product between rj and ppcZasses 

2.1 Word Embeddings 

The first layer of the network transforms words 
into representations that capture syntactic and 
semantic information about the words. Given 
a sentence x consisting of N words x = 
{wi,W 2 j ..., wn}, every word Wn is converted into 
a real-valued vector r"'". Therefore, the input to 
the next layer is a sequence of real-valued vectors 
embx = ..., 

Word representations are encoded by column 
vectors in an embedding matrix ^ 1^1, 

where 1/ is a fixed-sized vocabulary. Each column 
^«)rcZ g ]^cZ™ corresponds to the word embedding 
of the z-th word in the vocabulary. We transform a 
word w into its word embedding r"' by using the 
matrix-vector product: 

J.W _ y^wrd^w 

where v'^ is a vector of size |E| which has value 
1 at index w and zero in all other positions. The 
matrix is a parameter to be learned, and the 

size of the word embedding dT is a hyperparame¬ 
ter to be chosen by the user. 

2.2 Word Position Embeddings 

In the task of relation classification, information 
that is needed to determine the class of a relation 



5(a:) = rl ■ W 


clcEses 


Figure 1: CR-CNN: a Neural Network for classi¬ 
fying by ranking. 


between two target nouns normally comes from 
words which are close to the target nouns. Zeng 
et al. ( |2014 1 propose the use of word position em¬ 
beddings (position features) which help the CNN 
by keeping track of how close words are to the tar¬ 
get nouns. These features are similar to the posi¬ 
tion features proposed by Collobert et al. (20111 
for the Semantic Role Labeling task. 

In this work we also experiment with the word 
position embeddings (WPE) proposed by Zeng et 
al. ( 2014[ ). The WPE is derived from the relative 
distances of the current word to the target nouni 
and noun 2 . Eor instance, in the sentence shown in 
Eigure [TJ the relative distances of left to car and 


plant are -1 and 2, respectively. As in (Collobert 


et ah, 20111, each relative distance is mapped to 


a vector of dimension which is initialized 

with random numbers, is a hyperparameter 
of the network. Given the vectors wpi and wp 2 for 
the word w with respect to the targets nouni and 
noun 2 , the position embedding of w is given by 









































































the concatenation of these two vectors, wpe^ = 
[WPI,WP2]- 

In the experiments where word position 

embeddings are used, the word embed¬ 
ding and the word position embedding of 

each word are concatenated to form the 

input for the convolutional layer, embx = 

{, wpe^^], [r'^^ , wpe^'^ , wpe^^]}. 

2.3 Sentence Representation 


The next step in the NN consists in creating the 
distributed vector representation Vx for the input 
sentence x. The main challenges in this step are 
the sentence size variability and the fact that im¬ 
portant information can appear at any position in 
the sentence. In recent work, convolutional ap¬ 
proaches have been used to tackle these issues 
when creating representations for text segments 


of different sizes (Zeng et ah, 2014 Hu et ah. 


2014 dos Santos and Gatti, 20141 and character- 


level representations of words of different sizes 
( |dos Santos and Zadrozny, 2014 l. Here, we use 
a convolutional layer to compute distributed vec¬ 
tor representations of the sentence. The convo¬ 
lutional layer first produces local features around 
each word in the sentence. Then, it combines these 
local features using a max operation to create a 
fixed-sized vecfor for fhe inpuf senfence. 

Given a senfence x, fhe convolufional layer ap¬ 
plies a mafrix-vecfor operation fo each window 
of size k of successive windows in embx = 
{r"'i, j..., }. Lef us define fhe vecfor Zn G 

concafenafion of a sequence of k word 
embeddings, cenfralized in fhe n-fh word: 




In order fo overcome fhe issue of referencing 
words wifh indices oufside of fhe senfence bound¬ 
aries, we augmenf fhe senfence wifh a special 

k — 1 

padding foken replicated —^— times af fhe be¬ 
ginning and fhe end. 

The convolufional layer compufes fhe j-fh ele- 
menf of fhe vecfor Tx G as follows: 


senfence. The fixed-sized disfribufed vecfor rep- 
resenfafion for fhe senfence is obfained by using 
fhe max over all word windows. Mafrix and 
vecfor b^ are paramefers fo be learned. The num¬ 
ber of convolufional unifs d'^, and fhe size of fhe 
word confexf window k are hyperparamefers fo be 
chosen by fhe user. If is imporfanf fo note fhaf d‘^ 
corresponds fo fhe size of fhe senfence represenfa- 
fion. 


2.4 Class embeddings and Scoring 

Given fhe disfribufed vecfor represenfafion of fhe 
inpuf senfence x, fhe nefwork wifh paramefer sef 
6 compufes fhe score for a class label c G C by 
using fhe dof producf 


where embedding mafrix whose 

columns encode fhe disfribufed vecfor represenfa- 
fions of fhe differenf class labels, and [pj/cZassesj^ 
is fhe column vecfor fhaf confains fhe embedding 
of fhe class c. Note fhaf fhe number of dimensions 
in each class embedding musf be equal fo fhe size 
of fhe senfence represenfafion, which is defined by 
d^. The embedding mafrix H^cZasses ^ parame¬ 
ter fo be learned by fhe nefwork. If is inifialized 
by randomly sampling each value from an uniform 

disfribufion: U (—r, r), where r = . / , ^ —-. 

^ ^ V \C\ +d<^ 


2.5 Training Procedure 

Our nefwork is frained by minimizing a pairwise 
ranking loss funcfion over fhe fraining sef D. The 
inpuf for each fraining round is a senfence x and 
fwo differenf class labels y~^ € C and c~ G C, 
where y~^ is a correcf class label for x and c~ is 
nof. Lef sg{x)y+ and sg{x)c- be respectively fhe 
scores for class labels and c~ generafed by fhe 
nefwork wifh paramefer sef 9. We propose a new 
logisfic loss funcfion over fhese scores in order fo 
frain CR-CNN: 


L= log{l + exp{'y{m'^ — sg{x)y+)) 
+ log{l + exp{'y{m~ + sg{x)c-)) 


( 1 ) 


Ni = [/ + b^)]j 

where G is the weight matrix of the 

convolutional layer and / is the hyperbolic tangent 
function. The same matrix is used to extract local 
features around each word window of the given 


where m'^ and m~ are margins and 7 is a scal¬ 
ing factor that magnifies fhe difference befween 
fhe score and fhe margin and helps fo penalize 
more on fhe prediction errors. The firsf ferm in 
fhe righf side of Equafion[T] decreases as fhe score 
se{x)y+ increases. The second ferm in fhe righf 















side decreases as the score se{x)^- decreases. 
Training CR-CNN by minimizing the loss func¬ 
tion in Equation [T] has the effect of training to give 
scores greater than for the correct class and 
(negative) scores smaller than —m~ for incorrect 
classes. In our experiments we set 7 to 2, to 
2.5 and m~ to 0.5. We use L2 regularization by 
adding the term /3||6*|p to EquationIn our ex¬ 
periments we set /3 to 0.001. We use stochastic 
gradient descent (SGD) to minimize the loss func¬ 
tion with respect to 9. 

Eike some other ranking approaches that only 
update two classes/examples at every training 
round ( [Weston et ah, 20lT| [Gao et ah, 2014 1 , we 
can efficiently train the network for tasks which 
have a very large number of classes. This is an 
advantage over softmax classifiers. 

On fhe ofher hand, sampling informafive nega¬ 
tive classes/examples can have a significanf impacf 
in fhe effectiveness of the learned model. In the 
case of our loss function, more informative nega¬ 
tive classes are the ones with a score larger than 
—m~. The number of classes in the relation clas¬ 
sification dataset that we use in our experiments is 
small. Therefore, in our experiments, given a sen¬ 
tence X with class label y'^, the incorrect class c~ 
that we choose to perform a SGD step is the one 
with the highest score among all incorrect classes 

c~ = argmax S 0 {x)c- 
c & C-, 

Eor tasks where the number of classes is large, 
we can fix a number of negative classes fo be con¬ 
sidered af each example and selecf fhe one wifh 
fhe largesf score fo perform a gradienf sfep. This 
approach is similar fo fhe one used by Wesfon ef 
al. ( 2014| l fo selecf negative examples. 

We use fhe backpropagafion algorifhm fo com- 
pufe gradienfs of fhe nefwork. In our experi- 
menfs, we implemenf the CR-CNN architecture 
and the backpropagation algorithm using Theano 
(Bergstra et ah, 2010 1 . 


2.6 Special Treatment of Artificial Classes 

In this work, we consider a class as artificial if it is 
used to group items that do not belong to any of the 
actual classes. An example of artificial class is fhe 
class Other in the SemEval 2010 relation classifi¬ 
cation task. In this task, the artificial class Other 
is used fo indicafe fhaf fhe relafion befween fwo 
nominals does nof belong fo any of fhe nine rela¬ 
fion classes of inferesf. Therefore, fhe class Other 
is very noisy since if groups many differenf fypes 


of relafions fhaf may nof have much in common. 

An imporfanf characferisfic of CR-CNN is fhaf 
if makes if easy fo reduce fhe effecf of arfificial 
classes by omitting their embeddings. If the em¬ 
bedding of a class label c is omitted, it means that 
the embedding matrix E/ciasses contain 

a column vector for c. One of the main benefits 
from this strategy is that the learning process fo¬ 
cuses on the “natural” classes only. Since the em¬ 
bedding of the artificial class is omitted, if will nof 
influence the prediction step, i.e., CR-CNN does 
not produce a score for the artificial class. 

In our experimenfs wifh fhe SemEval-2010 rela¬ 
fion classificafion fask, when fraining wifh a sen- 
fence X whose class label y = Other, fhe firsf 
ferm in the right side of Equation [T] is set to 
zero. During prediction time, a relation is clas¬ 
sified as Other only if all actual classes have neg¬ 
ative scores. Otherwise, it is classified wifh the 
class which has the largest score. 


3 Experimental Setup 


3.1 Dataset and Evaluation Metric 


We use the SemEval-2010 Task 8 dataset to per¬ 
form our experiments. This dataset contains 
10,717 examples annotated with 9 different rela¬ 
tion types and an artificial relafion Other, which 
is used to indicate that the relation in the exam¬ 
ple does not belong to any of the nine main rela¬ 
tion types. The nine relations are Cause-Effect, 
Component-Whole, Content-Container, Entity- 
Destination, Entity-Origin, Instrument-Agency, 
Member-Collection, Message-Topic and Product- 
Producer. Each example contains a sentence 
marked with two nominals ei and 62 , and the 
task consists of predicting the relation between 
the two nominals taking into consideration the di¬ 
rectionality. That means that the relation Cause- 
Effect(el,e2) is different from the relation Cause- 
Effect(e2,el), as shown in the examples below. 
More information about this dataset can be found 
in ( Hendrickx et ah, 2010 1 . 

The [war]e^ resulted in other collateral imperial 
[conquests]62 as well. => Cause-Effect(el,e 2 ) 

The [burstje^ has been caused by water hammer 
[pressure]62. => Cause-Effect(e 2 ,el) 


The SemEval-2010 Task 8 dataset is already 
partitioned into 8,000 training instances and 2,717 
test instances. We score our systems by using the 
Sem E val-2010 Task 8 official scorer, which com- 
pufes fhe macro-averaged El-scores for fhe nine 












actual relations (excluding Other) and takes the di¬ 
rectionality into consideration. 


3.2 Word Embeddings Initialization 


The word embeddings used in our experiments are 
initialized by means of unsupervised pre-training. 
We perform pre-training using the skip-gram NN 
architecture ( [Mikolov et ah, 2013| ) available in 
the word2vec tool. We use the December 2013 
snapshot of the English Wikipedia corpus to train 
word embeddings with word2vec. We prepro¬ 
cess the Wikipedia text using the steps described 


in (dos Santos and Gatti, 20141: (1) removal of 


paragraphs that are not in English; (2) substitu¬ 
tion of non-westem characters for a special char¬ 
acter; (3) tokenization of the text using the to- 
kenizer available with the Stanford POS Tagger 


dToutanova et ah, 2003[ ); (4) removal of sentences 
that are less than 20 characters long (including 
white spaces) or have less than 5 tokens. (5) lower¬ 
case all words and substitute each numerical digit 
by a 0. The resulting clean corpus contains about 
1.75 billion tokens. 


3.3 Neural Network Hyper-parameter 

We use 4-fold cross-validation to tune the neu¬ 
ral network hyperparameters. Eeaming rates in 
the range of 0.03 and 0.01 give relatively simi¬ 
lar results. Best results are achieved using be¬ 
tween 10 and 15 training epochs, depending on 
the CR-CNN configuration. In Table [TJ we show 
the selected hyperparameter values. Additionally, 
we use a learning rate schedule that decreases the 
learning rate A according to the training epoch t. 
The learning rate for epoch t, Xt, is computed us- 

■ u ■ X ^ 

mg the equation: Xt = —. 


4 Experimental Results 

4.1 Word Position Embeddings and Input 
Text Span 


In the experiments discussed in this section we as¬ 
sess the impact of using word position embeddings 
(WPE) and also propose a simpler alternative ap¬ 
proach that is almost as effective as WPEs. The 
main idea behind the use of WPEs in relation clas¬ 
sification task is to give some hint to the convo¬ 
lutional layer of how close a word is to the target 
nouns, based on the assumption that closer words 
have more impact than distant words. 

Here we hypothesize that most of the informa¬ 
tion needed to classify the relation appear between 
the two target nouns. Based on this hypothesis, 
we perform an experiment where the input for the 
convolutional layer consists of the word embed¬ 
dings of the word sequence {wei — 1 ,..., We 2 + 1 } 
where ei and 62 correspond to the positions of the 
first and the second target nouns, respectively. 

In Table we compare the results of different 
CR-CNN configurations. The first column indi¬ 
cates whether the full sentence was used (Yes) or 
whether the text span between the target nouns 
was used (No). The second column informs if 
the WPEs were used or not. It is clear that the 
use of WPEs is essential when the full sentence is 
used, since El jumps from 74.3 to 84.1. This ef¬ 
fect of WPEs is reported by (Zeng et ah, 20141. On 
the other hand, when using only the text span be¬ 
tween the target nouns, the impact of WPE is much 
smaller. With this strategy, we achieve a El of 82.8 
using only word embeddings as input, which is a 
result as good as the previous state-of-the-art El of 
83.0 reported in (Yu et ah, 20141 for the SemEval- 
2010 Task 8 dataset. This experimental result also 
suggests that, in this task, the CNN works better 
for short texts. 

All experiments reported in the next sections 
use CR-CNN with full sentence and WPEs. 


Parameter 

Parameter Name 

Value 

<r 

Word Emb. size 

400 


Word Pos. Emb. size 

70 

d" 

Convolutinal Units 

1000 

k 

Context Window size 

3 

X 

Initial Eearning Rate 

0.025 


Table 1: CR-CNN Hyperparameters 


Full 

Sentence 

Word 

Position 

Prec. 

Rec. 

El 

Yes 

Yes 

83.7 

84.7 

84.1 

No 

Yes 

83.3 

83.9 

83.5 

No 

No 

83.4 

82.3 

82.8 

Yes 

No 

78.1 

71.5 

74.3 


Table 2: Comparison of different CR-CNN con 
figurations. 
















4.2 Impact of Omitting the Embedding of the 
artificial class Other 

In this experiment we assess the impact of omit¬ 
ting the embedding of the class Other. As we 
mentioned above, this class is very noisy since it 
groups many different infrequent relation types. 
Its embedding is difficult to define and fherefore 
brings noise info fhe classificafion process of fhe 
nafural classes. In Table we presenf fhe resulfs 
comparing fhe use and omission of embedding 
for fhe class Other. The fwo firsf lines of resulfs 
presenf fhe official FI, which does nol lake info 
accounl fhe resulfs for fhe class Other. We can see 
fhal by omilling fhe embedding of fhe class Other 
bolh precision and recall for fhe ofher classes im¬ 
prove, which resulfs in an increase of 1.4 in fhe 
FI. These resulfs suggesl fhal fhe slralegy we use 
in CR-CNN lo avoid fhe noise of arlificial classes 
is effeclive. 


Use embedding 
of class Other 

Class 

Free. 

Rec. 

FI 

No 

All 

83.7 

84.7 

84.1 

Yes 

All 

81.3 

84.3 

82.7 

No 

Ofher 

52.0 

48.7 

50.3 

Yes 

Ofher 

60.1 

48.7 

53.8 


Table 3: Impacf of nol using an embedding for 
fhe arlificial class Ofher. 


a soflmax classifier. We lune fhe parameters of 
CNN-i-Soflmax by using a 4-fold cross-validation 
wifh fhe fraining sef. Compared lo fhe hyperpa- 
ramefer values for CR-CNN presenfed in Table [T] 
fhe only difference for CNN-tSoflmax is fhe num¬ 
ber of convolulional unils dF, which is sef lo 400. 

In Table we compare fhe resulfs of CR- 
CNN and CNN-i-Soffmax. CR-CNN oufperforms 
CNN-i-Soffmax in bolh precision and recall, and 
improves fhe FI by 1.6. The Ihird tine in Ta¬ 
ble shows fhe resull reported by Zeng el al. 
( 2014| l when only word embeddings and WPEs 
are used as inpuf lo fhe nelwork (similar fo our 
CNN-tSoflmax). We believe fhal fhe word embed¬ 
dings employed by Ihem is fhe main reason Iheir 
resull is much worse lhan fhal of CNN-i-Soffmax. 
We use word embeddings of size 400 while Ihey 
use word embeddings of size 50, which were 
Irained using much less unlabeled dala lhan we 
did. 


Neural Net. 

Free. Rec. FI 

CR-CNN 

CNN-rSoftMax 

83.7 84.7 84.1 

82.1 83.1 82.5 

CNN-tSoffMax 

78.9 

(Zeng ef ah, 2014) 


Table 4: Comparison of resulfs of CR-CNN and 
CNN-i-Soffmax. 


In fhe fwo Iasi tines of Table we presenf fhe 
resulfs for fhe class Other. We can nofe fhal 
while fhe recall for fhe cases classified as Other 
remains 48.7, fhe precision significanlly decreases 
from 60.1 fo 52.0 when fhe embedding of fhe class 
Other is nol used. Thai means fhal more cases 
from nafural classes (all) are now been classified 
as Other. However, as bolh fhe precision and fhe 
recall of fhe nafural classes increase, fhe cases fhal 
are now classified as Other musl be cases fhal are 
also wrongly classified when fhe embedding of fhe 
class Other is used. 

4.3 CR-CNN versus CNN-i-Softmax 

In Ihis seclion we report experimenlal resulfs com¬ 
paring CR-CNN wifh CNN-i-Soffmax. In order 
lo do a fair comparison, we’ve implemenfed a 
CNN-tSoflmax and Irained if wifh fhe same dala, 
word embeddings and WPEs used in CR-CNN. 
Concretely, our CNN-i-Soffmax consisfs in gelling 
the output of the convolutional layer, which is the 
vector Vx in Eigure [T] and giving it as input for 


4.4 Comparison with the State-of-the-art 

In Table we compare CR-CNN results with 
results recently published for the SemEval-2010 
Task 8 dataset. Rink and Harabagiu ( |2010[ ) present 
a support vector machine (SVM) classifier fhal is 
fed wifh a rich (fradilional) fealure sef. If ob- 
lains an El of 82.2, which was fhe besl resull 
al Se mE val-2010 Task 8. Socher el al. ( |2012| ) 
presenf resulfs for a recursive neural nelwork 
(RNN) fhal employs a malrix-veclor represenfa- 
lion fo every node in a parse free in order lo com¬ 
pose fhe dislribufed vector represenlafion for fhe 
complete senlence. Their melhod is named fhe 
malrix-veclor recursive neural nelwork (MVRNN) 
and achieves a El of 82.4 when POS, NER and 
WordNel fealures are used. In (Zeng el ah, 20141, 
fhe aulhors presenf resulfs for a CNN-tSoflmax 
classifier which employs lexical and senlence- 
level fealures. Their classifier achieves a El of 
82.7 when adding a handcrafted fealure based on 
fhe WordNel. Yu el al. (2014) presenf fhe Eaclor- 






















based Compositional Embedding Model (FCM), 
which achieves a FI of 83.0 by deriving sentence- 
level and substructure embeddings from word em¬ 
beddings utilizing dependency trees and named 
entities. 

As we can see in the last line of Table CR- 
CNN using the full sentence, word embeddings 
and WPEs outperforms all previous reported re¬ 
sults and reaches a new state-of-the-art FI of 84.1. 
This is a remarkable result since we do not use 
any complicated features that depend on external 
lexical resources such as WordNet and NFP tools 
such as named entity recognizers (NERs) and de¬ 
pendency parsers. 

We can see in Table S that CR-CNnQ also 
achieves the best result among the systems that 
use word embeddings as the only input features. 
The closest result (80.6), which is produced by the 
FCM system of Yu et al. (2014), is 2.2 FI points 
behind CR-CNN result (82.8). 


4.5 Most Representative Trigrams for each 
Relation 

In Table for each relation type we present the 
five trigrams in the test set which contributed the 
most for scoring correctly classified examples. 
Remember fhaf in CR-CNN, given a sentence x 
the score for the class c is computed by S 0 {x)c = 
rj In order to compute the most rep¬ 

resentative trigram of a sentence x, we trace back 
each position in to find fhe frigram responsible 
for if. For each frigram t, we compufe ifs particular 
confribufion for fhe score by summing fhe ferms 
in score fhaf use posifions in thaf frace back fo 
t. The mosf representative frigram in x is fhe one 
wifh fhe largesf confribufion fo fhe improvemenf of 
the score. In order to create the results presented 
in Table we rank the trigrams which were se¬ 
lected as the most representative of any sentence 
in decreasing order of contribution value. If a tri¬ 
gram appears as the largest contributor for more 
than one sentence, its contribuition value becomes 
the sum of its contribution for each sentence. 

We can see in Tablethat for most classes, the 
trigrams that contributed the most to increase the 
score are indeed very informative regarding the re¬ 
lation type. As expected, different trigrams play 
an important role depending on the direction of 
the relation. For instance, the most informative tri- 

*This is the result using only the text span between the 
target nouns. 


gram for Entity-Origin(el, e2) is “away from the ”, 
while reverse direction of the relation, Entity- 
Origin(e2,el) or Origin-Entity, has “the source 
of” as the most informative trigram. These re¬ 
sults are a step towards the extraction of meaning¬ 
ful knowledge from models produced by CNNs. 


5 Related Work 


Over the years, various approaches have been 


proposed for relation classification (Zhang, 2004 


IQian et al., 2009t[Hendrickx et al., 2010[|Rink and 


Harabagiu, 2010). Most of them treat it as a multi¬ 


class classification problem and apply a variety of 
machine learning techniques to the task in order to 
achieve a high accuracy. 


Recently, deep learning (Bengio, 2009) has be¬ 
come an attractive area for multiple applications, 
including computer vision, speech recognition and 
natural language processing. Among the different 
deep learning strategies, convolutional neural net¬ 
works have been successfully applied to different 


NFP task such as part-of-speech tagging (dos San- 
tos and Zadrozny, 2014|), sentiment analysis (|Kim, 


2014[ dos Santos and Gatti, 20141, question classi 


fication ( [Kalchbrenner et al., 2014| ), semantic role 
labeling ([Collobert et al., 2011), hashtag predic¬ 


tion (Weston et al., 2014), sentence completion 
and response matching ( Hu et al., 2014| ). 

Some recent work on deep learning for relation 


classification include Socher et al. (2012), Zeng 


et al. (2014) and Yu et al. (2014). In (Socher et 


al., 2012), the authors tackle relation classification 


using a recursive neural network (RNN) that as¬ 
signs a matrix-vector representation to every node 
in a parse tree. The representation for the com¬ 
plete sentence is computed bottom-up by recur¬ 
sively combining the words according to the syn¬ 
tactic structure of the parse tree Their method is 
named the matrix-vector recursive neural network 
(MVRNN). 


Zeng et al. (2014) propose an approach for re¬ 
lation classification where sentence-level features 
are learned through a CNN, which has word em¬ 
bedding and position features as its input. In par¬ 
allel, lexical features are extracted according to 
given nouns. Then sentence-level and lexical fea¬ 
tures are concatenated into a single vector and 
fed into a softmax classifier for prediction. This 
approach achieves sfafe-of-fhe-arf performance on 
fhe SemEval-2010 Task 8 dafasef. 


Yu ef al. (2014) propose a Factor-based Com- 














































Classifier 

Feature Set 

FI 

SVM 

POS, prefixes, morphological, WordNet, dependency parse, 


I Rink and Harabagiu, 2010| 

Levin classes, ProBank, FrameNet, NomLex-Plus, 

Google n-gram, paraphrases, TextRunner 

82.2 

RNN 

word embeddings 

74.8 

(Socher et al., 2012 1 

word embeddings, POS, NER, WordNet 

77.6 

MVRNN 

word embeddings 

79.1 

(Socher et al., 2012 1 

word embeddings, POS, NER, WordNet 

82.4 


word embeddings 

69.7 

CNN+Softmax 

word embeddings, word position embeddings. 

82.7 

(Zeng et al., 2014 1 

word pair, words around word pair, WordNet 

F(2M 

word embeddings 

80.6 

(|Yu et al., 2014) 

word embeddings, dependency parse, NER 

83.0 

CR-CNN 

word embeddings 

82.8 

word embeddings, word position embeddings 

84.1 


Table 5: Comparison with results published in the literature. 


Relation 

(el,e2) 

(e2,el) 

Cause-Effect 

el resulted in, el caused a, had caused 
the, poverty cause e2, caused a e2 

e2 caused by, was caused by, are 
caused by, been caused by, e2 from el 

Component-Whole 

el of the, of the e2, part of the, 
in the e2, el on the 

e2’s el, with its el, e2 has a, 
e2 comprises the, e2 with el 

Content-Container 

was in a, was hidden in, were in a, 
was inside a, was contained in 

e2 full of, e2 with el, e2 was full, 
e2 contained a, e2 with cold 

Entity-Destination 

el into the, el into a, el to the, 
was put inside, imported into the 

- 

Entity-Origin 

away from the, derived from a, had 
left the, derived from an, el from the 

the source of, e2 grape el, 
e2 butter el 

Instrument-Agency 

are used by, el for e2, is used by, 
trade for e2, with the e2 

with a el, by using el, e2 finds a, 
e2 with a, e2 , who 

Member-Collection 

of the e2, in the e2, of this e2, 
the political e2, el collected in 

e2 of el, of wild el, of elven el, 
e2 of different, of 0000 el 

Message-Topic 

el is the, el asserts the, el that the, 
on the e2, el inform about 

described in the, discussed in the, 
featured in numerous, discussed 
in cabinet, documented in two. 

Product-Producer 

el by the, by a e2, of the e2, 
by the e2, from the e2 

e2 of the, e2 has constructed, e2’s el, 
e2 came up, e2 who created 


Table 6: List of most representative trigrams for each relation type. 


positional Embedding Model (FCM) by deriving 
sentence-level and substructure embeddings from 
word embeddings, utilizing dependency trees and 
named entities. It achieves slightly higher accu¬ 
racy on the same dataset than (Zeng et al., 20141, 
but only when syntactic information is used. 


There are two main differences between the ap¬ 
proach proposed in this paper and the ones pro¬ 
posed in ( Socher et al., 2012} Zeng et al., 2014 Yu 


et al., 20141: (1) CR-CNN uses a pair-wise rank¬ 


ing method, while other approaches apply multi¬ 
class classihcation by using the softmax function 
on the top of the CNN/RNN; and (2) CR-CNN 
employs an effective method to deal with artificial 
classes by omitting their embeddings, while other 
approaches treat all classes equally. 


6 Conclusion 

In this work we tackle the relation classification 
task using a CNN that performs classification by 
ranking. The main contributions of this work are: 
(1) the dehnition of a new state-of-the-art for the 
SemEval-2010 Task 8 dataset without using any 
costly handcrafted features; (2) the proposal of a 
new CNN for classification that uses class embed¬ 
dings and a new rank loss function; (3) an effective 
method to deal with artihcial classes by omitting 
their embeddings in CR-CNN; (4) the demonstra¬ 
tion that using only the text between target nomi- 
nals is almost as effective as using WPEs; and (5) 
a method to extract from the CR-CNN model the 
most representative contexts of each relation type. 
Although we apply CR-CNN to relation classifica¬ 
tion, this method can be used for any classification 
task. 
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